XML Information Retrieval Considering Physical Page Layout of Logical Elements

نویسندگان

Toshiyuki Shimizu

Masatoshi Yoshikawa

چکیده

XML information retrieval (XML-IR) systems utilize the logical structure of XML documents for retrieving relevant elements. From a practical point of view, displaying the search results of XML-IR systems is important to achieve. When we search XML documents that are constructed by marking up documents originally composed of pages, such as scholarly articles or books, we would like result elements to be overlaid on the physical layout of pages in the user interfaces. We propose such a displaying method for keyword searches on XML documents of scholarly articles and ranking methods based on page units. We also need a new ranking method different from those used in simple element ranking because multiple result elements may be in the same page. We propose a ranking method considering the benefit that we obtain from the result elements and the reading effort that needs to be spent in reading the result elements and nearby elements to understand the content of the result elements.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

XML Retrieval

DEFINITION Text documents often contain a mixture of structured and unstructured content. One way to format this mixed content is according to the adopted W3C standard for information repositories and exchanges, the eXtensible Mark-up Language (XML). In contrast to HTML, which is mainly layout-oriented, XML follows the fundamental concept of separating the logical structure of a document from i...

متن کامل

Query Relaxation by Structure and Semantics for Retrieval of Logical Web Documents

Since WWW encourages hypertext and hypermedia document authoring (e.g. HTML or XML), Web authors tend to create documents that are composed of multiple pages connected with hyperlinks. A Web document may be authored in multiple ways, such as (1) all information in one physical page, or (2) a main page and the related information in separate linked pages. Existing Web search engines, however, re...

متن کامل

Relationships in Structured Text Retrieval

SYNONYM None DEFINITION In structured text retrieval, the relationship between text components may be used in ranking components relative to a given query. MAIN TEXT In a structured text document, there exists a relationship between the document components. In the context of XML retrieval, the relationships between elements are provided by the logical structure of the XML markup. An element, un...

متن کامل

Improved CHAID algorithm for document structure modelling

This paper proposes a technique for the logical labelling of document images. It makes use of a decision-tree based approach to learn and then recognise the logical elements of a page. A state-of-the-art OCR gives the physical features needed by the system. Each block of text is extracted during the layout analysis and raw physical features are collected and stored in the ALTO format. The data-...

متن کامل